Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New caching and formatting functionality #634

Merged
merged 79 commits into from
Jul 4, 2022

Conversation

roshanshariff
Copy link
Collaborator

@roshanshariff roshanshariff commented Jun 13, 2022

This pull request continues the work in #628.

Public API

This is a summary of all the Citar public functions

Bibliography data

  • citar-get-entry: (KEY)
  • citar-get-entries: ()
  • citar-get-value: (FIELD KEY-OR-ENTRY)
  • citar-get-field-with-value: (FIELDS KEY-OR-ENTRY): Returns (FIELD . VALUE), the first existing field among FIELDS.
  • citar-get-display-value: (FIELDS KEY-OR-ENTRY)

Associated files, notes, and links

These functions take an optional keyword argument :entries, which defaults to calling (citar-get-entries). If this argument is provided, they won't access the cache. They also look for resources through keys found in the crossref field.

  • citar-get-files: (KEY-OR-KEYS)
  • citar-get-links: (KEY-OR-KEYS)
  • citar-has-files: ()
  • citar-has-notes: ()
  • citar-has-links: ()

The citar-has- functions return predicates that take a KEY argument and return whether that key has files/notes/links. The returned functions can be used with the :filter argument to citar-select-ref:

(citar-select-ref :filter (citar-has-files))

Selecting references

  • citar-select-refs: (&key multiple filter): Always returns a list of keys, regardless of whether :multiple is nil.
  • citar-select-ref: (&key filter): Always returns a single key.

At-point information

  • citar-key-at-point
  • citar-citation-at-point

Commands

...

Formatting

This is in citar-format.el. Formatting references is now done in three stages:

  1. In the first stage, all required entry fields are looked up and formatted into strings.
  2. In the second stage, these strings are joined together after accounting for the current screen width.
  3. (In citar--ref-completion-table) Metadata like the "has:notes" and "has:files" tags are looked up and added.

The first step is done when the bibliography is parsed, while the second and third are fast enough to do for every completion session.

Behind the scenes, there is new code to parse format strings in the existing format into a new, efficient, list-based format. All the functions now operate on this parsed representation internally. When formatting or pre-formatting entire bibliographies, this parsing only needs to be done once.

Caching

This functionality is in citar-cache.el. Instead of buffer-local and global caches, we now maintain a centralized cache which holds records for each bibliography file in use. This cache also keeps track of which buffers are using each bibliography, so that the cached files can be discarded when they are no longer used by any buffer.

The cache also holds "pre-formatted" candidate strings (i.e. the result of step 1 above). These strings do not depend on any transient information like whether each entry has files/notes or the screen width.

Remaining work

  • Allow the cache to be refreshed manually using citar-refresh. Currently the cache is never updated after the first time each bibliography is parsed. Now, the bibliography is automatically updated whenever you call a citar command and it has changed on disk. If there is demand, we can add a mode where Citar never updates the bibliography unless explicitly requested.
  • Integrate the file-notify functionality so the cache can be updated whenever the underlying bibliography file changes. We are removing file-notify functionality for now.
  • Clean up doc strings.
  • fix select-multiple with org-cite-insert
  • check embark-act both in minibuffer and buffer (there are bugs in both)
  • test performance with very large libraries

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

On the todos:

Integrate the file-notify functionality so the cache can be updated whenever the underlying bibliography file changes.

Do you really mean file-notify?

Or is this necessary to avoid the update when running citar-select-ref?

@roshanshariff
Copy link
Collaborator Author

Right, as it stands the cache never gets updated at all, even when calling citar-select-ref. I should change this so it re-parses the bibliography on calling citar-select-ref if the file has changed. Then we'd hook it up to filenotify so that the update happens automatically and there isn't a big wait on the next completion session.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

First glance looks good.

When you think it's ready to merge to the other branch, mark it ready for review?

I'll test more a bit later.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

Or if easier, I suppose you could rebase on main, and I can close the other.

Up to you.

@roshanshariff
Copy link
Collaborator Author

Sure, I'll rebase on main so we don't have to do another merge dance in the future.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

Except why keep track of the checksums if we need file-notify?

With new cache design, maybe we should turn on file-notify be default?

@roshanshariff roshanshariff changed the base branch from cache--api-refactor to main June 13, 2022 22:36
@roshanshariff
Copy link
Collaborator Author

We don't need file notify, but it's nice to have. File-notify can also report spurious change events in some cases, so the checksum guards against unnecessary updates. And yes, we can turn on file-notify by default now. But we should probably have a fallback for people who can't use file-notify (maybe they're on a network file system or something?)

We'll have to fine-tune the details on this and add appropriate customization options. For example, I can imagine somebody might have a bibliography on a tramp connection and they don't want emacs to try to update it every time they start completion. Even just checking the file size and time could involve a long round-trip they'd rather not have.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

Can you also take a look at the new capf and make sure it works?

I have a problem testing it.

Should just require a slight tweak to work with this.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

I'm going to edit your TODO directly, to keep it all in one place.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 13, 2022

I'll put some general notes here:

First, I tried citar-org-roam and it works without modification! I may move that back here, BTW.

Second, performance:

First, not surprisingly, this version of citar--ref-completion-tables is significantly (~5.5x) faster than citar--format-candidates (and the version in #628):

ELISP> (benchmark-run-compiled 100 (citar--ref-completion-table))
(12.98203636 43 6.759782593000001)

Alas, see my comment below on UX snappiness with large libraries: #634 (comment). In that scenario, not caching the completion candidates means it's slower overall.

Aside: maybe we should change the name to citar--ref-format-completions?

citar.el Outdated Show resolved Hide resolved
citar.el Outdated Show resolved Hide resolved
citar.el Outdated
@@ -179,6 +193,25 @@ All functions that match a particular field are run in order."
:type '(alist :key-type (choice (const t) (repeat string))
:value-type function))

(defcustom citar-prefilter-entries '(nil . t)
Copy link
Contributor

@bdarcus bdarcus Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roshanshariff WDYT of this idea of mine?

This would address #449. Earlier I had closed it because I couldn't figure out a flexible way to do this, but this seems a good approach, that's a practical use for filtering?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to implement a Consult-like narrowing system. I looked at its code, and behind the scenes it maintains a list of predicates that are combined into minibuffer-completion-predicate. We'd bind some keys to allow predicates to be enabled and disabled on-the-fly, just like Consult does.

Then we could make citar-select-ref activate a certain predicate by default, but you'd always be able to press a key to remove the filter and/or add other filters.

Copy link
Contributor

@bdarcus bdarcus Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be valuable, without us requiring consult.

But this (really the car) is orthogonal, and is just saying whether to pre-filter with citar-open-notes and such, as per #449.

As in, this and that are complementary.

In any case, feel free to remove or adapt this at will.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 14, 2022

Responsiveness with large libraries

This approach has the major advantage that the "has-" indicators are always in sync. It also seems a clean architecture, with appropriate separation of concerns.

But citar-select-ref is slower, and it is noticeable with large libraries.

Try this 6000+ reference/3.2 MB file.

http://www-alg.ist.hokudai.ac.jp/~thomas/COLTBIB/colt.bib.gz

There is a noticeable pause, akin to having no cache on my ~1000 item bib.

What if we cache the completion candidates too?

This approximates that (I don't include the affixation function, but it doesn't make an obvious difference):

ELISP> (setq my/completions (citar--ref-completion-table))
#<hash-table equal 6934/8311 0x475237>
ELISP> (completing-read "test: " my/completions)

The difference is obvious (though not with my own smaller library).

I guess we'll need to decide what kind of performance users should expect, with what size libraries?

My own view is one shouldn't expect for the citar UI to be highly responsive with such large files.

But this approach and also caching the completions are not mutually exclusive, so we could always add it later if needed?

Context: I'm running these tests on a Lenovo ThinkPad X1 Carbon 7th gen laptop, with i5-8265U processor and 16 GB of RAM. Not the fastest machine, for sure, but no slouch.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 14, 2022

We don't need file notify, but it's nice to have?

Why?

citar-cache.el Outdated Show resolved Hide resolved
@roshanshariff
Copy link
Collaborator Author

But citar-select-ref is slower, and it is noticeable with large libraries.

Thanks for that big file; I will test with it on top of own ~1000 key bibliography. I'm implementing some low-hanging fruit in terms of improving performance that should help.

However, you're right that no matter how hard we try, for large enough libraries performance will be a concern. I'd prefer not to add another cache, however, with all the accompanying headaches. Instead, I think we should expose a few options that gradually make the results less "live" but improve performance. The idea would be to maintain the current architecture, but have the "preformat" stage do more and more work.

So, for example, we currently adapt the completions to the current frame width. We could have a defcustom to specify a fixed width for completion candidates. Then the preformat can produce a single string of that width instead of a list of strings, and there's little work to be done at display-time.

An even more extreme option would make the preformat include the has:note and has:file markers, and they would only be updated when the bibliography changes or you run citar-refresh.

We'd have these options disabled by default, but document them in the Readme as "options for large bibliography files". What do you think? If you enabled all these options, I think you'd get pretty much the performance of having a candidates cache (but only a few users will need to go that far).

Why?

Without file-notify, if the bibliography changes, you will have to wait for it to be parsed the next time you call completion. People might prefer to have Emacs re-parse it automatically (perhaps when it's idle) so there's no wait.

Can you also take a look at the new capf and make sure it works?

I modified citar-capf.el to account for the new API, and when I run (citar-capf) manually in a Markdown or Org buffer it seems to return reasonable results. But I still can't get the completion to work for some reason. I've added citar-capf to completion-at-point-functions, and completion does work normally in other buffers. I'll investigate this further when I have some time. In the meantime I've pushed the small changes needed.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 14, 2022

I think we should expose a few options that gradually make the results less "live" but improve performance. The idea would be to maintain the current architecture, but have the "preformat" stage do more and more work.

In any case, I do like the idea of incremental iteration (certainly not doing everything in this PR), in response to real user needs, but to have a clean foundation.

Do you agree this foundation is, indeed, cleaner, and easier to maintain going forward, than the current code?

So, for example, we currently adapt the completions to the current frame width. We could have a defcustom to specify a fixed width for completion candidates. Then the preformat can produce a single string of that width instead of a list of strings, and there's little work to be done at display-time.

An even more extreme option would make the preformat include the has:note and has:file markers, and they would only be updated when the bibliography changes or you run citar-refresh.

We'd have these options disabled by default, but document them in the Readme as "options for large bibliography files". What do you think? If you enabled all these options, I think you'd get pretty much the performance of having a candidates cache (but only a few users will need to go that far).

Do you imagine two boolean defcustoms? Or one with choices?

EDIT: or maybe one defcustom called something like citar-caching-strategy or some such, with the default what it is now?

Why?

Without file-notify, if the bibliography changes, you will have to wait for it to be parsed the next time you call completion. People might prefer to have Emacs re-parse it automatically (perhaps when it's idle) so there's no wait.

Right. But then would want to be when idle (or using the async library; though that would add a dependency).

Can you also take a look at the new capf and make sure it works?

I modified citar-capf.el to account for the new API, and when I run (citar-capf) manually in a Markdown or Org buffer it seems to return reasonable results. But I still can't get the completion to work for some reason. I've added citar-capf to completion-at-point-functions, and completion does work normally in other buffers. I'll investigate this further when I have some time. In the meantime I've pushed the small changes needed.

+1

@roshanshariff
Copy link
Collaborator Author

Do you agree this foundation is, indeed, cleaner, and easier to maintain going forward, than the current code?

Yes, I think all the future improvements I can think of would be easier to build upon this new architecture than what we have now.

As you say, we should respond to real user needs. If we decide that we need better performance on very large files, I'm not opposed to adding new caches if needed. But this should be a last resort.

Sometimes it's easy to avoid improving performance just by adding a cache on top, but the complexity of the system increases rapidly when you do this. And if we do end up adding another cache in the end, it's good if it can be invalidated and reconstructed quickly anyway :)

Do you imagine two boolean defcustoms? Or one with choices?

I'll have to think about this more, and I think the correct design will become apparent as we go. I need to do some profiling to figure out where the bottlenecks currently are so we're not optimizing the wrong thing.

citar.el Outdated Show resolved Hide resolved
@bdarcus
Copy link
Contributor

bdarcus commented Jun 15, 2022

I'll have to think about this more, and I think the correct design will become apparent as we go.

It might be the defcustom values are user-oriented.

Like the current behavior results in constantly updated related resource indicators, which sacrifices some ultimate performance and responsiveness, while the other optimizations you note sacrifice the former to optimize the latter.

EDIT: maybe a defcustom like citar-update-related-resources-ui, with values of with-bib-files (for large libraries) and always (default)? Something like that would be simple and clear for users, and flexible enough, but leave room for us to iterate, including internally.

@roshanshariff
Copy link
Collaborator Author

Actually, all this might be moot. With some optimization (a little hacky for now), I'm able to get it down to about 800 ms to format 7000 entries, with live has:notes and has:files indicators and adaptive column widths. And I think there are some more avenues for improvement.

Turns out ~70% of the time was being spent in garbage collection, not in string handling...

@andersjohansson
Copy link
Contributor

This sounds great! Just to add a user perspective, I currently always use a file exported from Zotero with my full library (ca 6500 entries), so keeping that fast is important for me. The initial parsing always takes 5-10 seconds or something (with json) which is of course unavoidable, but in case significant and slow work with all entries is done at other stages, I think the suggestion of configuring more "static" display could be good. I’ll gladly help testing when you think it is ready enough.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 15, 2022

The initial parsing always takes 5-10 seconds or something (with json) ...

Did you ever test JSON vs BibTeX?

My hypothesis is JSON would be faster, but I've never tested.

If it is, we might note that in the docs.

I’ll gladly help testing when you think it is ready enough.

I think once Roshan pushes that commit he's working on, it should be ready to test this piece.

@bdarcus
Copy link
Contributor

bdarcus commented Jun 15, 2022

Actually, all this might be moot. With some optimization (a little hacky for now), I'm able to get it down to about 800 ms ...

That's great. But wiill it be fast enough for citar-select-ref, in cases like @andersjohansson'?

Now that he confirms using a very large library (I wasn't sure how common it was), we should probably make him happy :-)

Turns out ~70% of the time was being spent in garbage collection, not in string handling...

And the upshot of that? We can tune to reduce the impact?

@bdarcus bdarcus force-pushed the cache--api-refactor branch from dcf79a1 to aa05050 Compare July 3, 2022 23:15
@roshanshariff
Copy link
Collaborator Author

roshanshariff commented Jul 4, 2022

I would be inclined to remove the file-notify functionality for now. The code is rather complex, and I can't think of a good way to have it work without locking out the UI. Without it, you get the parsing delay the first time you use a Citar command after the bibliography has changed, rather than every time it changes.

In the future I'd like to look into a way of doing the parsing in the background, but that is going to be challenging. A background emacs process would need to send the parsed data back as text, parsing which is probably just as slow as parse-bib. And apparently the Emacs multi-threading support is very immature and not suitable for serious use.

EDIT: Sorry, I'd missed your recent messages. Needless to say, I agree :)

bdarcus and others added 3 commits July 3, 2022 21:55
Bibliography files are now automatically re-parsed if their size or
modification time have changed and their hash has changed. Also re-parse if the
bibliography file has not changed but `citar--fields-to-parse` returns a
different list.

There should no longer be any need to manually refresh the cache, since calling
any citar command will do that if needed. Therefore, remove the
`citar-cache-refresh` command.

Some other small changes:
* Calculate the bibliography file hash only if its size and/or modification time
have changed.
* Update the preformatted strings if the format string has
changed, even if the bibliography hasn't.
@roshanshariff
Copy link
Collaborator Author

roshanshariff commented Jul 4, 2022

I've pushed a new commit which changes the caching behaviour. Now calling any citar command will refresh any files if and only if needed; see the commit message. I feel this is the right default for almost all users.

I've removed the citar-cache-refresh command since it shouldn't be needed any more. For development purposes (while working on the parsing code for example) it's probably more useful to just clear the cache with (clrhash citar-cache--bibliographies).

If there is demand, we can add a mode where Citar never updates files after first reading them, unless you manually request it. This could be useful for people who have large, frequently changing bibliographies and want Citar to intentionally use outdated information. But before doing this, I'd like to take a hard look at the performance of parsing and see what the bottlenecks are.

Finally, as a small help for benchmarking, Citar now prints a message whenever it's updating a bibliography, and prints the elapsed time. If you want to benchmark the candidate formatting, which is the other performance-sensitive feature, use

(citar-get-bibliographies) ;; make sure bibliographies are in cache
(benchmark-run-compiled 10 (citar--format-candidates)) ;; Remember to divide the elapsed time by 10

README.org Outdated Show resolved Hide resolved
Also add convenience functions for adding and removing notes backends.
@bdarcus bdarcus force-pushed the cache--api-refactor branch from eb368b8 to 11f0fbe Compare July 4, 2022 09:04
@bdarcus bdarcus added bug Something isn't working documentation Improvements or additions to documentation breaking change Includes changes that will break current code. labels Jul 4, 2022
@bdarcus bdarcus force-pushed the cache--api-refactor branch from 11f0fbe to b04dd2b Compare July 4, 2022 12:47
Comment on lines +332 to +337
(defun citar-file-has-notes (&optional _entries)
"Return predicate testing whether cite key has associated notes."
;; REVIEW why this optional arg when not needed?
(let ((files (citar-file--get-notes-hash)))
(lambda (key)
(gethash key files))))
Copy link
Contributor

@bdarcus bdarcus Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roshanshariff can you review this little thing on citar-has-note?

See my comment here.

I didn't need entries here, and I don't need them in the org-roam one.

Or maybe I'm misunderstanding how this should work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a more full code review later this week when I have some time, but the main reason is for commonality with the file-related functions. There it's useful to have the entries argument to check if an entry has the file field.

On the other hand, with our current notes backends, it is true that you only need the cite key. But I can imagine adding support in the future for the BibTeX note field, so maybe it's a good idea to leave the API as-is and just ignore the entries argument if your backend doesn't need it,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can think on it a bit.

But if a function needs it, it can pull the entries; no?

E.g. if I understand right, doesn't seem justified to design for a hypothetical need when we have two concrete backends neither of which need it.

Copy link
Collaborator Author

@roshanshariff roshanshariff Jul 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a matter of taste to some extent. The code that's calling has-notes has a set of entries that it is operating on. To me, it seems cleaner to pass that on to the backend, which can either use it or ignore it. The alternative is for the backend to call back into Citar to get the entries, but this is not trivial. It involves asking the major-mode function for the list of bibliographies, getting them from the cache, merging their entries into another big hash table (which can use a lot of memory for no reason), etc. This is the reason some other API functions accept an optional :entries keyword argument, so they can avoid calling citar-get-entries if the entries hash table is already available.

And finally, it's easier to unit test a function just by passing it a hash table, knowing that its behaviour only depends on the arguments you pass in (rather than having to set up an environment where citar-get-entries returns the right thing).

In general, I prefer functions that take arguments for the information they need and return results that depend only on those arguments rather than global state. Of course, global state is unavoidable, but I think it's better for testability, reasoning, etc. if access to it is not scattered throughout the code base but carefully controlled to a few places.

EDIT: Also, if the file has changed on disk between the first time citar-get-entries was called by citar--format-candidates and when the backends call it again, then they could end up using different sources of information (not to mention parsing the bibliography files twice or more). I know this is not very likely, but even the theoretical possibility of race conditions like this makes the software engineer part of me very uncomfortable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. In any case, I'll defer to you; just thought I'd raise it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and of course I might come around to a different view later. But also see my edit above about race conditions.

In general, accessing the cache is not "free", especially now that it can unpredictably decide to re-parse files. In general, the correct way to use it is to get everything you need from the cache at the beginning of some operation (i.e. call citar-get-entries in this case), use that information throughout, and then throw it away only when you're done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other news, I fixed at least one obvious bug (that was causing a compilation error noted in the MELPA review) in the capf. Perhaps it works now?

I can't easily test it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a more full code review later this week when I have some time ...

I'm not sure if it's useful, but I just added this branch.

You should have write access, if you need/want it.

@bdarcus bdarcus force-pushed the cache--api-refactor branch from b04dd2b to a3bee3c Compare July 4, 2022 13:41
@bdarcus bdarcus merged commit ce51325 into emacs-citar:main Jul 4, 2022
@bdarcus
Copy link
Contributor

bdarcus commented Jul 4, 2022

Crossing-fingers, and .... merged!

I also submitted the citar-embark recipe to MELPA.

There are still a couple of little things; please take a look, including at my question above, and feel free to submit (nice and small) PRs!

@andersjohansson
Copy link
Contributor

andersjohansson commented Aug 8, 2022

Updated, and everything seems to work great!

Some feedback on real-world performance:

On my 6500 items library, with native json support:

Updating bibliography ~/bibliotek/bibliotek.json...done (3.700 seconds)
Updating bibliography ~/bibliotek/bibliotek.bib...done (10.567 seconds)

Running:

(benchmark-run-compiled 10 (citar--format-candidates))

Gives: (7.243678976 39 6.092884648). .7 seconds doesn’t seem too bad.

Thanks to the updates of org-cite-csl-activate, @andras-simonyi, one parsing step is cut out from my workflow as well.

I guess people on older computers may perhaps need to take action.

Nevertheless, my switch from citing in org-mode with the help of org-zotxt which calls the really slow Zotero server, to citing with citar feels like a great update. Citing once the library is parsed is really quick!

@bdarcus
Copy link
Contributor

bdarcus commented Aug 8, 2022

@andersjohansson - we have identified some performance issues in #659, which should be addressed in parsebib and in citar "soon". But it only impacts initial parsing, or when the file changes.

@roshanshariff roshanshariff deleted the cache--api-refactor branch December 12, 2023 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Includes changes that will break current code. bug Something isn't working chore documentation Improvements or additions to documentation enhancement New feature or request refactor
Projects
None yet
5 participants